Skip to content

Provide Dockerfile for midstream MPI CUDA image#656

Merged
openshift-merge-bot[bot] merged 1 commit into
opendatahub-io:mainfrom
sutaakar:odh-mpi
Mar 6, 2026
Merged

Provide Dockerfile for midstream MPI CUDA image#656
openshift-merge-bot[bot] merged 1 commit into
opendatahub-io:mainfrom
sutaakar:odh-mpi

Conversation

@sutaakar

@sutaakar sutaakar commented Mar 5, 2026

Copy link
Copy Markdown
Contributor

Description

How Has This Been Tested?

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • New training container image supporting Python 3.12, CUDA 13.0, PyTorch 2.9, and OpenMPI 4.1 with GPU-accelerated, multi-node training and SSH access.
    • Runtime wrappers and environment setup ensure the container runs as the expected non-root UID, exposes correct PATH/LD_LIBRARY_PATH, and launches with a UID-aware entrypoint.
  • Bug Fixes

    • Resolved library compatibility by pinning and reinstalling key Python and native packages (numpy, scipy, pyarrow, pillow) and restoring Python runtime stability.

@sutaakar sutaakar requested a review from kapil27 March 5, 2026 14:45
@coderabbitai

coderabbitai Bot commented Mar 5, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Adds a new Dockerfile building a CUDA 13.0 + Python 3.12 + PyTorch 2.9.0 training image. Installs runtime libraries and UCX RPMs, builds OpenMPI with CUDA/UCX support, configures SSH client/server and environment wrappers, installs Python deps via micropipenv plus targeted pins (numpy, scipy, pyarrow, pillow, s3fs), and introduces an uid_entrypoint script and permission changes for OpenShift UID/GID handling. Image runs as user 1001 by default and sets LD_LIBRARY_PATH, PATH, and NVIDIA_REQUIRE_CUDA env vars.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Security findings

  • SSH host-key verification disabled — likely StrictHostKeyChecking=no. Risk: MITM (CWE-295). Action: remove disabling, use known_hosts or explicit host keys, or require SSH_KNOWN_HOSTS injection at runtime.

  • Group-writable filesystem/permission changes (g=u and group-writable /etc/passwd handling) — incorrect permission assignment (CWE-276) and improper privilege management (CWE-269). Action: restrict writable scopes to minimal files; avoid making /etc/passwd or code paths group-writable; ensure UID registration uses secure APIs and limited writable temp locations.

  • LD_LIBRARY_PATH set to include OpenMPI/UCX libs — untrusted search path risk (CWE-426). Action: prefer rpath in built binaries or set library paths via secure, non-user-writable locations; ensure no user-writable directories precede system libs.

  • uid_entrypoint.sh modifies /etc/passwd at container start — risk of privilege escalation, TOCTOU, and injection (CWE-269, CWE-367). Action: validate and sanitize UIDs/GIDs, use atomic operations, restrict inputs, and drop privileges before modifying persistent system files if possible.

  • SSH server/client configured for environment forwarding and custom port in container — increases attack surface. Action: limit allowed environment variables, disable agent forwarding unless required, and bind SSH only when needed; require explicit runtime enablement.

  • Prebuilt UCX/OpenMPI RPMs and other native packages installed — supply-chain risk. Action: verify package signatures/checksums (compare SHA256), pin versions, and document provenance; scan binaries for known CVEs.

  • Pinned Python native packages (numpy/scipy/pyarrow/pillow) — potential CVE exposure. Action: run SCA (CVE) checks against pinned versions and prefer wheels built against the same base libraries; consider rebuilding against base glibc/libjpeg/libpng versions in image.

Address each finding before approving; provide fixes as minimal, testable changes (configurable known_hosts, reduced writable scope, validated uid handling, verified RPM signatures).

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: introducing a Dockerfile for a midstream MPI CUDA training image.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile`:
- Around line 120-123: The RUN block that uses sed/echo to modify
/etc/ssh/sshd_config (the lines that set "StrictModes" and "Port") weakens SSH
trust by forcing StrictModes no and appending duplicate entries; revert this by
removing the echo that forces "StrictModes no" and ensure sshd_config keeps
StrictModes enabled (do not set StrictModes to "no"), stop appending duplicate
Port/StrictModes lines (only uncomment existing directives with sed), and add
steps to enforce secure filesystem permissions for user homes and SSH keys
(ensure user home directories are not group/world-writable and ~/.ssh is 700
with private keys 600) so the Dockerfile RUN block that edits sshd_config and
sets permissions secures SSH correctly.
- Line 154: The Dockerfile currently grants group-write to installed Python
libraries via the line that runs chmod -R g+w on
/opt/app-root/lib/python3.12/site-packages, which allows unprivileged processes
to modify site-packages; remove that chmod and instead ensure site-packages
ownership and permissions are locked down (e.g., set owner to root and remove
group/other write bits) so only privileged users can modify libraries; locate
the chmod invocation in the Dockerfile (the line touching
/opt/app-root/lib/python3.12/site-packages) and replace it with steps that set
secure ownership (chown root:root) and restrictive permissions (remove g+w and
o+w) for the site-packages directory.
- Around line 114-117: The RUN step in the Dockerfile currently disables SSH
host key verification (the sed edit to /etc/ssh/ssh_config and adding
"UserKnownHostsFile /dev/null"), which must be removed; instead revert the sed
change so StrictHostKeyChecking remains enabled, stop writing UserKnownHostsFile
/dev/null, and populate a trusted host key store (e.g., /etc/ssh/ssh_known_hosts
or per-container ~/.ssh/known_hosts) using ssh-keyscan for cluster peers at
build or startup; ensure the SSH_PORT setting is preserved but do not discard
known_hosts—update the RUN step that edits /etc/ssh/ssh_config and add a step
that gathers and installs host keys before starting SSH to maintain
verification.
- Line 1: The Dockerfile currently installs compilers and -devel build tools
inside a single RUN layer, leaving artifacts in image history; refactor to a
multi-stage build by adding a separate builder stage (e.g., start a new FROM ...
AS builder), move the RUN that installs gcc, gcc-c++, make, perl, and -devel
packages and all build/compile steps into that builder stage, then create a
final stage from the original base image and COPY only the built artifacts
(binaries, wheels, libraries) from the builder stage into the final image;
remove any installation/removal commands from the final stage so build tools are
never present in final layers (reference the existing RUN that installs and
removes build tools in the Dockerfile and replace it with builder-stage steps
and a minimal final stage).
- Around line 32-41: The Dockerfile currently downloads and installs UCX
tarballs/rpms directly (using UCX_VERSION in the RUN block) without integrity
checks; update the UCX download/install steps (and the OpenMPI download lines
noted at 61–63) to verify cryptographic integrity before installation: fetch the
corresponding checksum (e.g., .sha256 or .sha512) and/or signature (.asc),
import and trust the publisher GPG key, verify the signature with gpg or compare
the computed checksum against the published value, and only proceed to
extract/rpm -ivh if verification succeeds; ensure failures abort the build and
remove unverified artifacts.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 0b5b0db2-fa40-484f-b8c8-c82ba01ff8e1

📥 Commits

Reviewing files that changed from the base of the PR and between 45ec38c and e7db758.

📒 Files selected for processing (1)
  • images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile

Comment thread images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile
Comment thread images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile Outdated
Comment thread images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile
Comment thread images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile
Comment thread images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile (1)

1-1: ⚠️ Potential issue | 🟠 Major

Single-stage build still violates the mandatory multi-stage policy.

Severity: Major (CWE-250).
Exploit scenario: build compilers/dev tooling are removed from the final filesystem, but remain recoverable from layer history; a registry/layer reader can extract and reuse those artifacts, increasing post-compromise capability.

Remediation scaffold
-FROM quay.io/opendatahub/odh-midstream-cuda-base-13-0:268a2d4baec5ed3c3ae09a6cce325fb83622d87b
+FROM quay.io/opendatahub/odh-midstream-cuda-base-13-0:268a2d4baec5ed3c3ae09a6cce325fb83622d87b AS builder
@@
-RUN dnf install -y \
+RUN dnf install -y \
         rdma-core libibverbs librdmacm libibumad libmlx5 infiniband-diags \
         hwloc libevent pmix \
         gcc gcc-c++ make perl \
         rdma-core-devel libibverbs-devel librdmacm-devel \
         hwloc-devel libevent-devel pmix-devel zlib-devel \
@@
-    && dnf remove -y \
-        gcc gcc-c++ make perl \
-        rdma-core-devel libibverbs-devel librdmacm-devel \
-        hwloc-devel libevent-devel pmix-devel zlib-devel \
-    && dnf clean all \
+    && dnf clean all \
     && rm -rf /tmp/openmpi-${OPENMPI_VERSION} /tmp/openmpi.tar.gz
+
+FROM quay.io/opendatahub/odh-midstream-cuda-base-13-0:268a2d4baec5ed3c3ae09a6cce325fb83622d87b AS runtime
+RUN dnf install -y openssh-server libjpeg-turbo libpng libwebp rdma-core libibverbs librdmacm libibumad libmlx5 infiniband-diags hwloc libevent pmix && dnf clean all
+COPY --from=builder /usr/lib64/openmpi /usr/lib64/openmpi
+COPY --from=builder /usr/lib64/ucx /usr/lib64/ucx
+COPY --from=builder /usr/bin/mpiexec /usr/bin/mpiexec
+COPY --from=builder /usr/bin/orted /usr/bin/orted

As per coding guidelines, **/Dockerfile*: DOCKERFILE SECURITY (CWE-798, CWE-250): 4. Use multi-stage builds.

Also applies to: 56-89

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile` at line
1, The Dockerfile currently uses a single-stage build (base image:
quay.io/opendatahub/odh-midstream-cuda-base-13-0:268a2d4baec5ed3c3ae09a6cce325fb83622d87b)
which violates the multi-stage build policy; refactor it into a multi-stage
Dockerfile by adding a builder stage that contains compilers and dev tooling to
perform all build steps, then COPY only the produced runtime artifacts into a
minimal final stage (remove build tools from the final image), ensuring final
stage starts FROM a slim/runtime base and only copies necessary files from the
builder stage to eliminate recoverable build artifacts in image layers (update
the Dockerfile references and stage names accordingly).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile`:
- Around line 39-45: The Dockerfile currently installs and removes UCX RPMs
using rpm with the --nodeps flag (the rpm -ivh ... /tmp/ucx-*.rpm sequence and
the corresponding rpm -e removal commands), which bypasses dependency checks;
remove the --nodeps option from those rpm invocations so package dependencies
are validated during install/uninstall (ensure any required dependency packages
are added to the Dockerfile or handled via dnf/yum to satisfy them before
installing the UCX RPMs).
- Around line 135-140: The user sshd config (.sshd_config) sets a HostKey that
is never created, so update the container startup (e.g., uid_entrypoint.sh) to
ensure the user .ssh directory exists with correct permissions and to generate a
host key if missing: create the .ssh dir, chmod 700, check for the id_rsa host
key, and if absent run ssh-keygen to create an RSA host key and chmod 600 the
key before launching sshd; ensure this runs prior to starting sshd so the
HostKey entry in .sshd_config points to an existing key.
- Line 143: The Dockerfile currently force-upgrades an unpinned micropipenv (RUN
pip install -U "micropipenv[toml]") and separately installs s3fs outside the
Pipfile.lock; update this by pinning micropipenv to a specific vetted version
(remove the -U behavior) and ensure s3fs is added to the Pipfile.lock (with
hashes) so it is installed via micropipenv's lockfile rather than an ad-hoc pip
install; alternatively, if you must pip-install s3fs in the Dockerfile, install
a pinned s3fs==<version> with a verified hash and remove any --no-deps installs
that bypass lockfile resolution to prevent arbitrary remote wheel execution.

---

Duplicate comments:
In `@images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile`:
- Line 1: The Dockerfile currently uses a single-stage build (base image:
quay.io/opendatahub/odh-midstream-cuda-base-13-0:268a2d4baec5ed3c3ae09a6cce325fb83622d87b)
which violates the multi-stage build policy; refactor it into a multi-stage
Dockerfile by adding a builder stage that contains compilers and dev tooling to
perform all build steps, then COPY only the produced runtime artifacts into a
minimal final stage (remove build tools from the final image), ensuring final
stage starts FROM a slim/runtime base and only copies necessary files from the
builder stage to eliminate recoverable build artifacts in image layers (update
the Dockerfile references and stage names accordingly).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: b9e7dcf5-4c1d-475c-97ef-f8936d299750

📥 Commits

Reviewing files that changed from the base of the PR and between e7db758 and 7622c33.

📒 Files selected for processing (1)
  • images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile

Comment thread images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile
Comment thread images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile
Comment thread images/runtime/training/py312-cuda130-torch29-openmpi41/Dockerfile

@kapil27 kapil27 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm thanks @sutaakar

@sutaakar

sutaakar commented Mar 6, 2026

Copy link
Copy Markdown
Contributor Author

/approve

@openshift-ci

openshift-ci Bot commented Mar 6, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kapil27, sutaakar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved label Mar 6, 2026
@openshift-merge-bot openshift-merge-bot Bot merged commit 5d19b52 into opendatahub-io:main Mar 6, 2026
4 checks passed
@sutaakar sutaakar deleted the odh-mpi branch March 6, 2026 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants